graph : remove redundant GDN state transposes by ggerganov · Pull Request #20443 · ggml-org/llama.cpp

ggerganov · 2026-03-12T06:53:47Z

cont #20437
fix #20436

As correctly noted in #20436, there is no need to transpose the recurrent state in the GDN computation. Simplify the ggml graph for the unfused path + optimize the fused kernels using coalesced read/writes.

TODOs:

Wait for vulkan: add GATED_DELTA_NET op support #20334 to be merged and adapt the current PR

0cc4m · 2026-03-12T08:23:14Z

This first or the Vulkan implementation first? It's gonna be stuck waiting for the CI forever, if we have to keep adapting it before it has a chance to finish. Otherwise I have to merge it without waiting for the CI.

ggerganov · 2026-03-12T08:25:19Z

@0cc4m Yes, proceed with the Vulkan implementation. Sorry for the rapid changes.

ORippler · 2026-03-12T10:32:50Z

This currently speeds-up TG on NVGPU due to coalescing the writes (in TG we process n_token=1 and afterwards write the states back, whereas in PP we process ubatch tokens and write states back at the end). Uncoalesced reads are not so important on big caches, and I thus saw no trashing occur (though on pre-Ada GPUs this may happen). Further PP improvements will have to revolve around batching data-accesses (either process > 1 col per warp and do wider loads or unroll n_tokens). WIP for the latter would be #20449, as I'm not super familiar with data-layouting/alignment that comes into the cuda op

ORippler · 2026-03-12T10:49:28Z

Relevant NCU sections for TG

Relevant NCU sections for PP

…20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ORippler · 2026-03-13T17:11:37Z

We may consider removing the obsolete smem code in this PR (we unified HIP/CUDA paths in #20340 via #20391, and removed the SMEM as part of this). Can also file a separate PR if preferred

diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index eb3858820..1ce6d5f31 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -135,15 +135,6 @@ __global__ void gated_delta_net_cuda(const float * q,
     }
 }
 
-static size_t calculate_smem(const int sv, int cc)
-{
-    size_t smem = 0;
-    if ((GGML_CUDA_CC_IS_AMD(cc) && !GGML_CUDA_CC_IS_RDNA3(cc) && !GGML_CUDA_CC_IS_RDNA4(cc)) || GGML_CUDA_CC_IS_MTHREADS(cc)) {
-        smem = sv * sv * sizeof(float);
-    }
-    return smem;
-}
-
 template <bool KDA>
 static void launch_gated_delta_net(
         const float * q_d, const float * k_d, const float * v_d,
@@ -180,18 +171,14 @@ static void launch_gated_delta_net(
                 sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
             break;
         case 64: {
-            constexpr int sv = 64;
-            size_t smem = calculate_smem(sv, cc);
-            gated_delta_net_cuda<sv, KDA><<<grid_dims, block_dims, smem, stream>>>(
+            gated_delta_net_cuda<64, KDA><<<grid_dims, block_dims, 0, stream>>>(
                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
                 sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
             break;
         }
         case 128: {
-            constexpr int sv = 128;
-            size_t smem = calculate_smem(sv, cc);
-            gated_delta_net_cuda<sv, KDA><<<grid_dims, block_dims, smem, stream>>>(
+            gated_delta_net_cuda<128, KDA><<<grid_dims, block_dims, 0, stream>>>(
                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
                 sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);

jeffbolznv · 2026-03-13T20:36:48Z

This change made qwen3.5 about 10% slower for tg in the vulkan backend, with CPY getting 3x more expensive. We may be hitting some slow path, I'll look into it.

0cc4m · 2026-03-13T21:10:06Z

I saw something like that, but thought it best to wait until we get some more shader improvements in.

jeffbolznv · 2026-03-13T21:33:49Z

I don't think it was the copy shader, some driver issue was giving me bad timestamps. The timings I'm seeing now show the GDN op several times slower, but I'm not totally confident in the data.

pedapudi · 2026-03-13T21:50:05Z

This PR restored a tg regression on Strix Halo for Qwen MoE models using ROCm backend. Thanks for the PR and to the original reporter for the issue.

It seems that gfx1151 is particularly sensitive to the changes being made for GDN support. Perhaps related to susceptibility to register pressure.

I'm wondering if it's possible for pointed testing for that architecture specifically given the apparent sensitivity. Fully acknowledging the combinatorics explosion of needing to test a wide variety of architectures and the slippery slope, but this is a pointed request specifically for GDN related changes given the apparent platform sensitivity and the relative popularity of the architecture. Ideally, GDN related updates in llama.cpp should help with performance across the board, but avoiding regressions would be a nice baseline :)

Will defer to @IMbackK for insights on change management in relation to AMD architectures.

Cheers.

ggerganov requested a review from CISC as a code owner March 12, 2026 06:53

ggerganov mentioned this pull request Mar 12, 2026

ggml : transpose fused GDN state access for coalesced memory reads #20437

Closed

5 tasks

github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026

CISC approved these changes Mar 12, 2026

View reviewed changes

am17an mentioned this pull request Mar 12, 2026

CUDA: optimize GDN by hiding global memory loads #20448

Closed

1 task

arkavo-com and others added 4 commits March 13, 2026 18:12

llama : rever fgdn argument changes

d9a7ab3

graph : remove GDN state transposes

fe3ef4a

ggerganov force-pushed the gg/gdn-fix-state-transpose branch from 7ea6ee4 to fe3ef4a Compare March 13, 2026 16:13

vulkan : adapt

2882a4b

ggerganov requested a review from 0cc4m as a code owner March 13, 2026 16:48

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Mar 13, 2026

cuda : remove obsolete smem code

d466d89

ggerganov merged commit e30f1fd into master Mar 13, 2026
16 of 76 checks passed

ggerganov deleted the gg/gdn-fix-state-transpose branch March 13, 2026 20:12

am17an mentioned this pull request Mar 14, 2026

CUDA: GDN hide memory latency #20537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph : remove redundant GDN state transposes#20443

graph : remove redundant GDN state transposes#20443
ggerganov merged 6 commits intomasterfrom
gg/gdn-fix-state-transpose

ggerganov commented Mar 12, 2026 •

edited by 0cc4m

Loading

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

ORippler commented Mar 12, 2026 •

edited

Loading

Uh oh!

ORippler commented Mar 12, 2026

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

0cc4m commented Mar 13, 2026

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

pedapudi commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

ggerganov commented Mar 12, 2026 • edited by 0cc4m Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

ORippler commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Mar 12, 2026

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

0cc4m commented Mar 13, 2026

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

pedapudi commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ggerganov commented Mar 12, 2026 •

edited by 0cc4m

Loading

ORippler commented Mar 12, 2026 •

edited

Loading

pedapudi commented Mar 13, 2026 •

edited

Loading